Exploratory Data Analysis on Brazil E-Commerce Olist Stores Data

. Olist, the largest department store in Brazilian marketplaces. Olist connects small businesses from all over Brazil to channels without hassle and with a single contract. Those merchants are able to sell their products through the Olist Store and ship them directly to the customers using Olist logistics partners.

The dataset contains several .csv which contains information likes,order,customer,seller,feedback,product,geographic-location etc. After a customer purchases the product from Olist Store a seller gets notified to fulfill that order. Once the customer receives the product, or the estimated delivery date is due, the customer gets a satisfaction survey by email where he can give a note for the purchase experience and write down some comments. . The dataset has been downloaded from Kaggle

What is Exploratory Data Analysis holds for ?

Now a days data which is kind of information act as resources,and we often dig down and deeper inside to extract some information out of it. Hence EDA is kind of resource extraction and value addition in the our data. Apart from this other objective of the EDA is explore the unknown terrians of data and gather the some useful insights out of it.

Below are the outline as well as squential steps taken to dig down to dataset:

Installing and importing required packages

From the variety of option for executing the code on Jupyternotebook,kaggle,binder to Google cloud platform Colab, with very minimistilic or no changes code can be executed on the just few click of run button ( with make sure availability of all required packages).

Since dataset contains 89K of rows and 41 columns, so choosing the platform which can facilitate better ram for computing could be better options for getting smother executions.

Whereas the downloading the dataset from the Kaggle as well as commiting the notebook requires "API " Keys which one could get from their respective account credential.

In the process of EDA we will be using multiple library like numpy,seaborn,pandas,folium,wordcloud and plotly from python ecosystem. So let using python interpretor instalation to install the required packages

After installing required libraries we will import those libraries and utilized for required work

Downloading required dataset:

The dataset has been downloaded from kaggle using the kaggle account username and api_keys.

Data Preparation, cleaning dataset:

Dataset Schema:

The downloaded dataset contains 9 different csv file. So before performing EDA we should follow the schema of dataset to merge all the csv file into one single .csv file, which can ease out the process of performing different python operation.

For Ex. data order_reviews_dataset,order_payments_dataset,order_customer_dataset,order_reviews etc dataset will be mereged on the basis of the common columns like order_id and similary other dataframe can be merged by following the schema.

Merging different dataframe :

Instead of working with bits and bytes of dataframe it's much better to work one dataframe which can be done by combing all the dataframe as merged dataframe.Then we can perform our analysis

Let's extract a copy of the data from these columns into a new data frame df_copy. We can continue to modify further without affecting the original data frame.

Data preparation for EDA:

Before the drawing the useful insight from the dataset, we will be checking different information like number of columns,rows,data types etc.By performing this steps we can get fair estimate of the dataset, on which we build our EDA

Let's view the dataset shape,size and columns

Since all the required columns in the dataset doesn't have equal importance in terms of the analysis to be done or carried , so its bettter to drop the required columns

The final dataset contains 10 floats,5 ints,24 object datatype in our dataframe.So it seems that dataset doesn't contains any null values.

We'll convert the object type date column into datetime type using pd.to_datetime

Let's use pandas describe method to get statistical calculations of numerical columns.

Open-ended questions for exploratory data analysis and visualization:

By looking the individual columns we can frame open-ended question which is proven to be backbone for the EDA and we will try to answer those question either my numeric representation or visualization representation.

Let's set the chart style, font size, figure size and figure facecolour using matplotlib. These are reflected only to matplot and seaborn charts.

Price

Among different columns of the dataset,price columns gives the estimate of among of the money spend my the cutomers in his/her purchase. So let's explore this columns:

Mode of Payment

On exploring this columns of the dataframe we would get information on among different mode of payment, which mode of payment is most favoured by the customer, followed by other modes. So lets explore this columns

Now we want to seperate the text and number which is useful for making the pie chart. So let's convert from dataframe to dictionary, so that we could access keys and values seperately.

With the help of above information we can make pie chart

The inferences which i could draw from this pie chart is as following:

So this reflect that's majority of customer likes to pay their charges via credit card where as very small amount of people pay their charges via debit_card, which implies that E-Commerce stores needs to have card swiping machine really in good condition, altering which could hamper the revenue generation of the stores as well as difficulty for customer. If stores wants to recive the payment via debit-card then need some offers or some coupons or else some kind of endorsement to encourage the other mode of payment at stores.

Custom review score out of 5 points

For the overall experiances of customer in exchange of the amount of time spent as well goods and services purchased from the Olist stores, customer gives review score which is out of 5. So let's deep dive to see what % of customer gives 5 ,4,3,2 and 1 review scores.

So from this we could infer that out of all product, products for which 5.0 review score given is more and maximum than any of the review score given for any other product.

So among all the customers who have spend their money in purchasing goods and services at Olist stores 56.6 % of people are very happy for the kinds of services they got on the stores, where as there are considerable amount of 12.77 % people who feel or not happy with the kinds of services they got on Olist stores.

Number of installments for payment of their charges

As from the Mode of payment columns we come to know that maximum customer pay their charges via credit card. And for the total amount they paid at Olist stores, instead of paying bulk amount customer pay in chunk of smaller amount periodically over period of time. Hence opt for based on the finance liability they opt for the required number of installment for clear their due. Hence it could be informative to explore column "payment_installments"

Inferences what we can draw from the above plots are below :

Review Comment

After getting the score for the product purchased by customer, its indeed necessary to check what kind of the review messages or comment given by consumer. So instead looking the individual words as the feedback messages, let's create the wordcloud to see most frequently reviewed comment given by customer, which is visual representation of words.

So let's create the cloud of word from the respective columns and see the feedback messages in the pictorial way.

All the words which comes from wordcloud hasn't been written US or UK english,so it might required some work to convert these words into the simple english.

Among all type of the words comes as cloud some of the words comes are super, Excelente, Recomendo,boom, muito now all of these words represent state of statisfied customer and they recomend other customer also for purchasing the goods from Olis stores.And this very allign to the review score of 5,which consitute of 53% .

Whereas there exist few words like Intriguer which shows some bad Experience by the some of the customer, which on identification can be reached out to the customer and address the enquiry look for the remedies.

Ordered and confirmation time lag

Often it is been Experienced that there exist a time lag between when we give our order and the acceptance of the same order for further processing of packaging,shipping,logistic assignment etc. If the time lag is very nominal then user experience in purchasing from the Olist stores doesn't get hampered much because more the time lag more there is delay in final reaching products to its customer result in more waiting time of customers.

Hence just to make sure that Olist customer remain untouched from this problem, it's recomended to just keep an eye on these factor involving time lag.

Any kind of the Algebraic operation we intend to do on columns containing date and time information, we first try to convert the required columns Time stamp

After converting the all of the required columns into timestamp, creating the difference of two time stamp and them convert into result into the hour format.

Inferences what we could make from this plots are below :

Order status

After we got estimate of time lag, it's customary to know the details, of the product ordered by customer has finally reached to it's destination or not ? After all with minimal herdal and delay product should reach to it's customers.

So let's explore the required columns which gives the required informations.

Figure above shown is self exploratary in terms of numbers and informations it's draws.

Sellers City

As different product is being supplied by different sellers which comes from different geographical locations, so it's very much needed to know their proportions numbers corresponding to city. This can help in maintaining stock in different locations which can reduce the transportation time and cost for customers as well as company.

So let's deep dive into the above column

Category of product

After getting the severals of details including methods of payments,feedback,location of seller.It's most important to investigate what are the different category of products which are being sold by E-commerece olist stores.Doing this is going to give fair estimates of customers choice and preference for ordering product of their choices and taste.

While traversing through the name of columns of the data frame, column name as "product_category_name" contains the details of different category of products. Also apart from just getting the product category name, our attempt should also be to explore for where maximum mode of payment is credit card which category of products are being ordered most etc. So let's try to give visual representation to this column. So straight jump into the content of the columns.

Top 15 category of product purchased most purchased

Volume of different category of products : product_length_cm,product_height_cm,product_height_cm

After getting every details of related to products like category, product seller and it's zip code etc it's indeed necessary to know the volume of the products which is crucial property for sending the product via logistic partners. Based on product volume different logistic partner has been assigned as well cost of transportation is also being calculated based on the volume of product. Hence it's crucial information to be known as products and it's dimension is being concern.

So the dimension of products is being know by getting the length, breadth and height of the product. So we can perform the algebric operations on each of the columns. And then group the product based on the category and this way would get the estimate of the volume of product in respective category.

Interesting question and their Answer

After getting the necessary depth of the EDA perform on the every columns of the dataset.Let's use these analysis did in the above cell to answer some of the pin points question and finding necessary answer for these questions:

  1. Which are the Top 20 category of product customers bought by paying throug credit card? ?
  2. What's states of Brazil peoples favourite method of payment by customer at Olist-stores?
  3. Which category of product have highest volume (dimension) or greater than avg volume of products? ?
  4. What's orders counts of particular month of year for the given customers order?
  5. From which state of Brazil, there is maximum number of sellers ?
  6. How's the distribution of customers across the different geographical location of Brazil ?
  7. From which region of Brazil, Olist stores receive maximum orders ?
  8. In Which region of Brazil, customers orders is not getting delivered ?
  9. Which region of people use mostly credit card as their payment method ?

Q1. Which are the Top 20 category of product customers bought by paying throug credit card?

The answer of this question revolve around extracting the information of one column product_category_name to another payment_category: So on combining the information drawn from both columns would give the desired answer :

Let's try to interpret the result of this tree map: This plots represent the list of Top 20 products which are bieng purchased via credit card or we can also say that credit card payment which was most favourable method of payment was used for purchasing which type of product.

Now the each box in the graph show one category of product, hence 20 product so 20 rectangular box. And we know that rectangular box dimension represent the counting the category of product.So whichever rectangular box (means particular product category) have higher length and breadth represent larger dimension means that product category has larger counts. And similarly those box (product category) which looks very smaller means smaller counts i.e that product category have smaller counts. Following this interpration we can say that product category name cama_mesa_banho have larger counts followed by beleza_saude and esporte_larzer have

Q2. Which category of product have highest volume (dimension) or greater than avg volume of products?

So getting this information can also help in categorizing the product which can be used for getting the different logistic partner for smoother delivery as well those products can required bigger space.

Answer to this question can be grab by combining the information of two columns which are product_category_name & Product volume of the data frame df_copy.

So first we will segregate the information of volume from the data frame and then we will find the particular product category.

On looking this graphs it's seen that circle consist of different section or whole circle has been divided into different pieces. So each slice of circle represent each category of products and wider the slice of the circle, it means larger proportion of particular property.And the color of the each slice of the circle represent corresponding volume, and the numeric representation has been shown on the scale meter. So on combing all the information we can infer that.

Hence we can say that product_categoryname cama_mesa_banho, utilidades_domesticas,moveis_decoracao followed by casa_conforto have smaller count, which gives information of larger vehicle transportation for delivery of these product category.

Q3. How is the number of order received by the Olist E-Commerce stores varry across the day and month of time stamp?

Answer of this question involves around exploring the columns like shipping_limit_date, price, freight_value,order_status, order_purchase_timestamp, order_approved_at, order_delivered_carrier_date, order_delivered_customer_date, order_estimated_delivery_date. Now several information like days,month,year,time-stamp and other details has to be extracted on finer level.

Inferences which can be drawn from this plot can be explained in Tabular manner for better understanding

So we can say year 2018 Olist stores was able to generate constant revenue throughout the year, where as in the year 2016 stores revenue generation was varying in nature and not as steady as 2018.But the year 2017 Olist revenue generation follow steady pattern as year 2018.

Q4. What's orders counts of particular month of year for the given customers order?

Often the result of which month of year has been benifical for the Olist stores, as maximum number of order received play crucial role as getting the details of financial year.

So we will try to extract the month&year in concatinated way and takes it's counts as a measure of the number of order received at the E-commerece stores.

The story which is being narated from this bar graph is for Olist stores as E-Commerce stores the Oct month of 2016 has received the minimalistic orders which counts for 5K from the customers, whereas the among all other year, Oct-Nov of year 2017 has received the maximum order whose counts is 9K. And the other months of the subsequent year of Olist stores has received the nearly same order, with minimalistic differences between the number of orders received in the different years i.e acrossyear 2018 Olist stores has received the ~ same orders.

Q5. How do client orders come from different Brazilian regions?

As the Olist stores has been spread across different geographical locations of Brazil,but different geographical location may or may not contribute to same number of orders coming from customers to E-Commerece stores.So to check the numbers of orders Olist stores receive from different geo-location is being Analyzed in this plots.

But direction mapping from address or pin-code may not directly result in different geographical location. So here first attempt should be made to know the latitude and Longitude corresponding to different geo location of Brazils

So we need to pass the url to request module of python, and JSON which gives names of region.

Further we need to filter the Latitude & Longitude to get Nomenclature for different gelocation of Brazils.

After successful Filtration of different geographical locations, our attempt should be made to grab the required the necessary columns of the dataframe and merge it to get geo-mapping from br_info & geo_group correspondingly to latitude and longitude of dataframe.

Acquiring only required number of columns of the dataframe, used to getting particular plots.

The above graphs shows the amount of money customers pays for buying products varying across the different region of the Brazil. So in the graphs each markers shows the different Geo region of the Brazil and for whichever size of marker is bigger than other which means that people of that region spend more at that stores.

Therefor story which is being depicted from the above graphs can be explained as following:

Q6. How do order counts differ between various Brazil locations?

After getting the track on amount of money people of different region of Brazil spend, it is also necessary to know the hows the number orders vary i.e how many orders people from different region of Brazil has given to Olist E-Commerce stores Month of yearaccordingly.

The above graphs depict the number of orders coming from different region of Brazil spanning across Month of Year. So following below points can be concluded from the above graphs :

Q7. How's the number of customers vary for different city of Brazil across the year 2016,2017,2018 ?

To obtain the number of consumers scattered across the various cities in Brazil, 30K rows of data have been collected from the dataset pool for quick estimation.So on proceeding first have to filter with order_year for different year 2016,2017,2018 and get correspondingly latitude and longitude from the dataframe.

Here

Q8. In which parts of Brazils maximum number of orders get undelivered,canceled,&shipped ?

The conclusion which can be depicted from the above plots ( for the sample of dataset of 30 K rows) are :

The products which has been in shipped mode, might because of logistic constrains has been shown in the above map:

The above graphs shows the different location of Brazil where order is getting delivered( for the sample of dataset of 30 K rows)

Q9. Which region of people use mostly credit card as their payment method ?

As economy of states vary from the neighbouring state, this question would be interest of the people to know people of which states is seeking some kind of credit based financial support for their expenditures as well. Not just that but also their capability to repay the amount in minimal time.

So by integrating information of two columns payment_type as well as `

Answer of this question will be given by grabing the information from Mode of Payment column of the dataframe df_copy

On reviewing the statistics and other measures of the required column payment_type it's clear that among the several customers visited Olist stores, peoples mostly like to pay their bill via Credit card over other method of payments

Conclusion:

After grabbing the information in the form of dataset from the kaggle, by following the data schema final version of the existing the csv file has been obtained.Which afterward followed by performing the open ended analysis for individual columns of the dataframe as well the clubbing the information of the two or more required columns to draw better insight from the available data-set.

Future work:

Apart from the Open-Ended question deal by exploring the individual columns as well as combining the different columns to explore interesting question, it's also required to explore to extract more minute details such as

Referances